I am exited about this course and eager to learn more about the statistics behind machine learning and the different algorithms. I heard about this course when I took a couple of summer courses that Kimmo had.
This is my GitHub repository: link
I have taken some courses where regressionhas been a topic before, so this was not so many new things. However, always good to repeat.
This week was tricker than last week. I had very little time to work on the tasks this time and had to skip some of the theory. I will catch up on it in the end of this week.
imported_file <-read.csv(file="learning2014.csv")
head(imported_file)
## gender age attitude deep stra surf points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
str(imported_file)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data has 166 rows (observations) and 7 columns (variables). It is a survey of approaches to learning, conducted in 2014 at University of Helsinki. The number of students responded is 183.
pairs(imported_file[-1])
summary(imported_file)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
par(mfrow=c(2,3))
for(i in 2:7) {
boxplot(imported_file[,i], main=names(imported_file)[i])
}
Conserning the distribution of the variables: age is quite centered betwwen 20 and 30 but have quite an amount of older outliers. Attitude and stra hav an even distribution centered around 30 and 3 respectivly, deep has some otliers with low scores, surf is quite evenly distributed but has one outlier with higher value. Poits is a bit skewed to higher values but with quite many persons with lower values too. Concerning the relationship between variables: points and attitude seem to have a connection but relationship between other variables is difficult to see based on the scatterplot diagram.
fit <- lm(points~attitude + age + stra,data=imported_file)
summary(fit)
##
## Call:
## lm(formula = points ~ attitude + age + stra, data = imported_file)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 0.34808 0.05622 6.191 4.72e-09 ***
## age -0.08822 0.05302 -1.664 0.0981 .
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
fit2 <- lm(points~attitude,data=imported_file)
summary(fit2)
##
## Call:
## lm(formula = points ~ attitude, data = imported_file)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
My result is, not surprisingly thinking about what the scatterplots looked like, that only attitude has a significant correlation with points.
The p value is approx. 0.00000000412 which is well below 0.1% chnage of getting this data if there were no relationship between attitude and points. One unit chnage in attitude raises the points with, on average, 0.35 points. Our (adjusted) R-squared tells us that attitude can explain 18% of the variation in points.
par(mfrow = c(2,2))
plot(fit2, which = c(1,2,5))
A regression analysis assumes that the relationship between the explanatory and the dependent variable is linear. It also assumes that the errors are normally distributed, have a constant variance and size of given error does not depend on the explanatory variables. The normal Q-Q shows that the erros seem to be normally distributed. If the lain looks straigth, it is a sign of this. Residuals vs leverage shows that no outlier affects the model too much, as no point lies below the dotted Cook’s distance line. The residuals vs fitted picture shows if there is any specific pattern in the residual variations, which might be the case if the relationship between the explanatory and the dependent variable is not linear and if the error variance is not constant. In our case it looks quite ok, although there are some quite high residuals at y values 24-26.
getwd()
## [1] "/home/chpatola/IODS-project"
imported_file <-read.csv(file="data/allStudents.csv")
colnames(imported_file)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
str(imported_file)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
The data has 382 rows (observations) and 35 columns (variables). The file contains information about students in two Portuguise schools, among others family setting, grades in math and portuguise, alcohol use and study habits. More information is available here:[link] (https://archive.ics.uci.edu/ml/datasets/Student+Performance)
The four interesting variables that I choose and want to study in order to see if there is an connection to alcohol use are: Age (numeric: from 15 to 22) Address (home addres type: binary: ‘U’ - urban or ‘R’ - rural) Absences (Amount of school absences: numeric: from 0 to 93) Final Grade (numeric: from 0 to 20)
Below are my hypothesis for each of the variables connection with alcohol consumption: Age: The older a student, the higher probability of drinking Address: It could be less to do on the countryside and students drink more there. It could also be that studnets in the city drink more because it is easier to get hold on alcohol there Absences: Students who drink much can might do it due to problems at home and/or at school. This could mean they have more absences that others Final grade: see explenation above.
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Bar plot distributions for each value
gather(imported_file) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
# Boxplots
g1 <- ggplot(imported_file, aes(x = high_use, y = absences, col = address))
g2 <- ggplot(imported_file, aes(x = high_use, y = age))
g3 <- ggplot(imported_file, aes(x = high_use, y = G3))
g1 + geom_boxplot() + ylab("Amount of absences") + xlab("High alcohol use")
g2 + geom_boxplot() + ylab("Student age") + xlab("High alcohol use")
g3 + geom_boxplot() + ylab("Final grade") + xlab("High alcohol use")
#Barplot address
counts <- table(imported_file$address,imported_file$high_use)
barplot(counts, main="Students alcohol use use per address type, urban - rural",
xlab="High alcohol use", col=c("darkblue","blue"),
legend = rownames(counts))
cor(imported_file$G3, imported_file$high_use)
## [1] -0.1284528
cor(imported_file$absences, imported_file$high_use)
## [1] 0.2233679
Looking at the graphs, I could say that my theory about absences seem to be reasonable. Conserning age, students under 16 years do not seem to drink and more students above 17 years drink more. Final grade seems to have a slight negative correlation with high alcohol use. Distribution of “heavy drinkers” seems to be quite same in the countryside and in the city.
m <- glm(high_use ~ address + absences + G3 + age, data = imported_file, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ address + absences + G3 + age, family = "binomial",
## data = imported_file)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4810 -0.8301 -0.6844 1.1992 1.9067
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.28205 1.82767 -1.249 0.211807
## addressU -0.36549 0.27512 -1.329 0.184012
## absences 0.08360 0.02303 3.629 0.000284 ***
## G3 -0.06054 0.03551 -1.705 0.088195 .
## age 0.11978 0.10149 1.180 0.237918
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 439.30 on 377 degrees of freedom
## AIC: 449.3
##
## Number of Fisher Scoring iterations: 4
#Odd ratios
odd_ratio <- coef(m) %>% exp()
#Confidence intervals
conf_int <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(odd_ratio,conf_int)
## odd_ratio 2.5 % 97.5 %
## (Intercept) 0.1020743 0.002753262 3.628802
## addressU 0.6938539 0.406425388 1.198479
## absences 1.0871915 1.041208391 1.139729
## G3 0.9412543 0.877558570 1.009031
## age 1.1272469 0.924299016 1.377414
AIC (information lost in model) is 449.3
When I look at the summary of the fitted model, I see that only absences has a significant (positive) relationship with high use of alcohol.
Confidence intervals that includes 1 implies that there is no real difference in output values (alcohol use) depending in the value of the input variable. As one can see, only the confidence intervals for absences excludes 1.
For a given age, G3 score and address type, the odds of a student being a havy drinker is up approx 5-14% per unit increase in absence.
?log()
?predict
model <- glm(high_use ~ absences, data = imported_file, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = imported_file)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3754 -0.8181 -0.7282 1.2652 1.7475
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.28190 0.15920 -8.052 8.12e-16 ***
## absences 0.08982 0.02299 3.907 9.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 447.45 on 380 degrees of freedom
## AIC: 451.45
##
## Number of Fisher Scoring iterations: 4
imported_file$probabilities <-predict(model, type = "response")
imported_file$prediction <-imported_file$probabilities >0.5
head(imported_file, 2)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## nursery internet guardian traveltime studytime failures schoolsup famsup
## 1 yes no mother 2 2 0 yes no
## 2 no yes father 1 2 0 no yes
## paid activities higher romantic famrel freetime goout Dalc Walc health
## 1 no no yes no 4 3 4 1 1 3
## 2 no no yes no 5 3 3 1 1 3
## absences G1 G2 G3 alc_use high_use probabilities prediction
## 1 5 2 8 8 1 FALSE 0.3030510 FALSE
## 2 3 7 8 8 1 FALSE 0.2665014 FALSE
truth <-table(Using_much_alcohol =imported_file$high_use, Predicted_high_user=imported_file$prediction)
total_wrong<- truth["FALSE","TRUE"] + truth["TRUE","FALSE"]
training_error <- total_wrong/nrow(imported_file)
With the help of information on absense (only), I get a logistic regression model that predicts the drinking status with an error rat eof 29 %. AIC is 451.45, so a litle more than in the model with all 4 variables. If I compare that result to what one could achieve by just guessing, with a 50 % chance of correct guess, it is a bit better. Not a great model but better than none. However, the model is much better at identifying those that do not drink so much than identifying those that drink much. If we would like to find students that drink much (perhaps to support them in some way), the model would hardly be to any help at all as it - with this data - identifyed only 9 out of 114 students who drank much. ***
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
head(Boston,2)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.9
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The data has 506 rows (observations) and 14 columns (variables). All variables are numeric The file contains information about housing values in suburbs of Boston. More information is available here:[link] (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Boston_cor <-cor(Boston)
library(corrplot)
## corrplot 0.84 loaded
corrplot(Boston_cor, method="circle")
library(purrr)
library(tidyr)
library(ggplot2)
Boston %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
par(mar=c(1,1,1,1))
pairs(Boston)
The strongest positive correlations can be found between tax and rad and the strongest negative between dis and nox, dis and indus, dis and age as well as medv and lstat. Looking at the description of the variables (see link above), the relationships make sense.
Concerning the distributions of values: age and black are skewed to higher values, chas, tax and rad have two groups of values, rm and medv have sort of a normal distribution and dis and crim are skewed to low values.
#Standardize boston dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
#Correct class back to dataframe
class(boston_scaled)#matrix
## [1] "matrix"
class(Boston)#data frame
## [1] "data.frame"
boston_scaled_df <- as.data.frame(boston_scaled)
#Categorical variable of crime
bins <- quantile(boston_scaled_df$crim)
crime <- cut(boston_scaled_df$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)#We now have groups with quite same number of observations in each
## crime
## low med_low med_high high
## 127 126 126 127
#Remove original crim from the dataset
boston_scaled_df <- dplyr::select(boston_scaled_df, -crim)
# add the new categorical value to scaled data
boston_scaled_df <- data.frame(boston_scaled_df, crime)
colnames(boston_scaled_df)
## [1] "zn" "indus" "chas" "nox" "rm" "age" "dis"
## [8] "rad" "tax" "ptratio" "black" "lstat" "medv" "crime"
head(boston_scaled_df,3)
## zn indus chas nox rm age
## 1 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## dis rad tax ptratio black lstat
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## medv crime
## 1 0.1595278 low
## 2 -0.1014239 low
## 3 1.3229375 low
#Divide dataset into train and test
n <- nrow(boston_scaled_df)
ind <- sample(n, size = n * 0.8)
train_boston <- boston_scaled_df[ind,]
test_boston <- boston_scaled_df[-ind,]
# save the correct classes from test data
correct_classes <- test_boston$crime
# remove the crime variable from test data
test_boston <- dplyr::select(test_boston, -crime)
When we standardized the dataset, the mean became 0 and the standard devision 1. If we were to use the data for predicting a dependent variable, the effect of each explanatory varaibel would be equal and not larger for those with a larger scale (for example 0-100 in comparance to 0-5).
lda.boston <- lda(crime ~., data = train_boston)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train_boston$crime)
# plot the lda results
plot(lda.boston, dimen = 2, col= classes, pch = classes)
lda.arrows(lda.boston, myscale = 4)
predictions <- predict(lda.boston, newdata = test_boston)
table(correct = correct_classes, predicted = predictions$class)
## predicted
## correct low med_low med_high high
## low 10 14 1 0
## med_low 2 25 4 0
## med_high 2 8 11 1
## high 0 0 1 23
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
confusionMatrix(predictions$class, correct_classes)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low med_low med_high high
## low 10 2 2 0
## med_low 14 25 8 0
## med_high 1 4 11 1
## high 0 0 1 23
##
## Overall Statistics
##
## Accuracy : 0.6765
## 95% CI : (0.5766, 0.7658)
## No Information Rate : 0.3039
## P-Value [Acc > NIR] : 1.067e-14
##
## Kappa : 0.5598
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: low Class: med_low Class: med_high Class: high
## Sensitivity 0.40000 0.8065 0.5000 0.9583
## Specificity 0.94805 0.6901 0.9250 0.9872
## Pos Pred Value 0.71429 0.5319 0.6471 0.9583
## Neg Pred Value 0.82955 0.8909 0.8706 0.9872
## Prevalence 0.24510 0.3039 0.2157 0.2353
## Detection Rate 0.09804 0.2451 0.1078 0.2255
## Detection Prevalence 0.13725 0.4608 0.1667 0.2353
## Balanced Accuracy 0.67403 0.7483 0.7125 0.9728
When we look at the comparision on the real crime classes and the ones that the LDA model predicted, we see that it did quite well. Overall accuracy is 75 %
We did already standardize the Boston dataset in task 4.
#Calculated distances between suburbs
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#Run k-means algorithm on the dataset
km <-kmeans(boston_scaled, centers = 3)
pairs(boston_scaled, col = km$cluster)
#Investigate what is the optimal number of clusters
set.seed(11)
k_max <- 20
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
#Run k means again with different number of clusters
km <-kmeans(boston_scaled, centers = 6)
pairs(boston_scaled, col = km$cluster)
The distances between suburbs (in relation to all standardized variables) is between minimi 0.1243 to maximum 14.3970. Mean is 4.9111.
When we look at the graph, it seems that the more clusters, the better. However, too many clusters reduce the use of dividing data into peaces. At an extreme case, we have one cluster for each data point. The decrease slope flattens out quite mcuh after approx. 6-7, so we could probably choose that as opimal number of clusters.
With 6 clusters, we have divided the housing suburbs into 7 different groups, where all suburbs are more similar within the group than they are with suburbs in other groups. ***
humanDev <- read.csv("data/human.csv")
head(humanDev, 2)
## X fem_male_ed fem_male_job years_ed lifeexp gni_num mat_mort
## 1 Norway 1.0072389 0.8908297 17.5 81.6 64992 4
## 2 Australia 0.9968288 0.8189415 20.2 82.4 42261 6
## ad_birth_rate parliament
## 1 7.8 39.6
## 2 12.1 30.5
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
humanDev_numeric <- humanDev[,-1]
head(humanDev_numeric,2)
## fem_male_ed fem_male_job years_ed lifeexp gni_num mat_mort ad_birth_rate
## 1 1.0072389 0.8908297 17.5 81.6 64992 4 7.8
## 2 0.9968288 0.8189415 20.2 82.4 42261 6 12.1
## parliament
## 1 39.6
## 2 30.5
ggpairs(humanDev_numeric)
Looking at the distributions of the data and the relationships between them, we can say that many of the variables do not have connections with each other but years_ed and life exp seem to have a positive realtinship with a correlation of 0.79. Also birth_rate and mat_mort seem to have a relationship with a correlation of 0.76. Life exp and birth rate have a negative relationship with a correlation of 0.73 and mat mortality and years educ also have a negative relationship. The negative correlation is 0.74. Birt rate is also negatively correlated with years educ (0.7).
Concerning the distribution, only years education seems to have a normal distribution. fem_male educ, fem_male_job and parliament also come close to a normal distribution but the other vaiables have skewed distributions.
pca_human <- prcomp(humanDev_numeric)
summary(pca_human) #Almost all variation can be explained by the first PC
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
biplot(pca_human, choices = 1:2, col=c("black","blue"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
#Standardize dataset
humanDev_numeric_stand <- scale(humanDev_numeric)
summary(humanDev_numeric_stand)
## fem_male_ed fem_male_job years_ed lifeexp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## gni_num mat_mort ad_birth_rate parliament
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human_stand <- prcomp(humanDev_numeric_stand)
summary(pca_human_stand) #The variation is explained much more evenly across the components than with the unstandardized dataset
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
biplot(pca_human_stand, choices = 1:2, col=c("black","blue"))
hi
When we standardized the dataset, the mean became 0 and the standard devision 1. In the unstandardized datasets, variables with more variations due to broader range of values (like GNI numeric) are given much more importance in the PCA. When we standardize the variables, differences in scale do not matter anymore and we get a more balanced picture of how the different variables affet the variations in the model. We clearly see the difference between the standardized and not standardized data in our biplots and the model summaries. In the unstandardized version, GNI numeric explains most of the variation and as it affects PC1, PC1 accounts for as much as >99% of the variation. In the standardized version, we need all 8 PC:s to get up to the same explenational percent.
My interpretation on the biplot components 1 and 2 from the standardized data is: PC1 and PC2 do together explain approx 70% of the variation in the model. When one decide on how many components to use in a model, there are different ways of doing it. One way is to use components with a sd of at least 0.7 (here it would mean PC1-PC4) but one can also, for example, use graphical ways of doing it. If we now stick to only the two values PC1 and PC2 and focus on the biplot graph of them, we can say that fem_male_job and parliamanet are correlated and they affect PC2. Mat_mort and birth_rate are also corelated and they affect PC1. In general, a little angle between feature arrows indicate that they are correlated and the the PC they are in parallell with indicate that they affect that PC. The length os the arrow corresponds to the feature sd. In our model there is not so big differences between the standard deviations of the features.
library(FactoMineR)
library(dplyr)
library(tidyr)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
chosen_tea <- tea[,c(3,7,14,15)]
str(chosen_tea)
## 'data.frame': 300 obs. of 4 variables:
## $ evening: Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
mca_model <- MCA(chosen_tea)
summary(mca_model)
##
## Call:
## MCA(X = chosen_tea)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.298 0.275 0.254 0.245 0.238 0.190
## % of var. 19.860 18.304 16.933 16.364 15.878 12.661
## Cumulative % of var. 19.860 38.164 55.097 71.461 87.339 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.112 0.014 0.023 | -0.075 0.007 0.010 | 0.142 0.026
## 2 | -0.601 0.404 0.275 | 0.589 0.421 0.264 | -0.142 0.026
## 3 | -0.104 0.012 0.013 | -0.457 0.254 0.245 | -0.535 0.375
## 4 | 0.112 0.014 0.023 | -0.075 0.007 0.010 | 0.142 0.026
## 5 | -0.104 0.012 0.013 | -0.457 0.254 0.245 | -0.535 0.375
## 6 | -0.557 0.347 0.613 | -0.351 0.149 0.243 | 0.027 0.001
## 7 | -0.557 0.347 0.613 | -0.351 0.149 0.243 | 0.027 0.001
## 8 | -0.148 0.024 0.013 | 0.482 0.282 0.140 | -0.703 0.649
## 9 | -0.601 0.404 0.275 | 0.589 0.421 0.264 | -0.142 0.026
## 10 | -0.104 0.012 0.013 | -0.457 0.254 0.245 | -0.535 0.375
## cos2
## 1 0.037 |
## 2 0.015 |
## 3 0.335 |
## 4 0.037 |
## 5 0.335 |
## 6 0.001 |
## 7 0.001 |
## 8 0.298 |
## 9 0.015 |
## 10 0.335 |
##
## Categories
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## evening | 0.650 12.161 0.221 8.123 | -0.146 0.671 0.011
## Not.evening | -0.340 6.358 0.221 -8.123 | 0.077 0.351 0.011
## home | -0.077 0.486 0.193 -7.599 | 0.090 0.721 0.264
## Not.home | 2.499 15.720 0.193 7.599 | -2.922 23.324 0.264
## alone | -0.094 0.478 0.016 -2.205 | -0.623 22.949 0.720
## lemon | 1.276 15.020 0.201 7.754 | 0.846 7.160 0.088
## milk | -0.189 0.631 0.010 -1.687 | 1.347 34.669 0.482
## other | -1.325 4.421 0.054 -4.030 | 0.966 2.549 0.029
## No.sugar | -0.706 21.618 0.533 -12.624 | -0.280 3.676 0.084
## sugar | 0.755 23.108 0.533 12.624 | 0.299 3.929 0.084
## v.test Dim.3 ctr cos2 v.test
## evening -1.832 | -0.743 18.650 0.289 -9.288 |
## Not.evening 1.832 | 0.388 9.751 0.289 9.288 |
## home 8.886 | -0.068 0.444 0.150 -6.707 |
## Not.home -8.886 | 2.205 14.362 0.150 6.707 |
## alone -14.674 | -0.154 1.523 0.044 -3.635 |
## lemon 5.140 | 0.833 7.515 0.086 5.065 |
## milk 12.004 | -0.494 5.038 0.065 -4.401 |
## other 2.938 | 3.744 41.388 0.433 11.385 |
## No.sugar -4.997 | -0.112 0.642 0.013 -2.009 |
## sugar 4.997 | 0.120 0.686 0.013 2.009 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## evening | 0.221 0.011 0.289 |
## home | 0.193 0.264 0.150 |
## How | 0.245 0.739 0.563 |
## sugar | 0.533 0.084 0.013 |
plot(mca_model, invisible=c("ind"))
The tea dataset contains 300 rows and 36 columns. The individuals not home is different from the rest of the individulas. Probably because if you are not at home (ie for example at work), you are less likely to sit down with a good cup of tea. The rst of the individuals seem to have a relatiosnhip. Interesting is that out of them, lemon and no sugar are further from each other. If you have lemon in your tea, you might be more likely to alo have sugar in it.
#Importing data and inspecting it
library(dplyr)
library(tidyr)
#rats
rats <- read.csv("data/rats.csv")
summary(rats)
## ID Group time gram
## Min. : 1.00 Min. :1.00 WD1 :16 Min. :225.0
## 1st Qu.: 4.75 1st Qu.:1.00 WD15 :16 1st Qu.:267.0
## Median : 8.50 Median :1.50 WD22 :16 Median :344.5
## Mean : 8.50 Mean :1.75 WD29 :16 Mean :384.5
## 3rd Qu.:12.25 3rd Qu.:2.25 WD36 :16 3rd Qu.:511.2
## Max. :16.00 Max. :3.00 WD43 :16 Max. :628.0
## (Other):80
str(rats)
## 'data.frame': 176 obs. of 4 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: int 1 1 1 1 1 1 1 1 2 2 ...
## $ time : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gram : int 240 225 245 260 255 260 275 245 410 405 ...
rats$ID <- as.factor(rats$ID)
rats$Group <- as.factor(rats$Group)
str(rats)
## 'data.frame': 176 obs. of 4 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ time : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gram : int 240 225 245 260 255 260 275 245 410 405 ...
#Make time to integer
rats$time <- as.character(rats$time)
rats$time <- as.numeric(gsub("WD","",rats$time))
str(rats)
## 'data.frame': 176 obs. of 4 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ time : num 1 1 1 1 1 1 1 1 1 1 ...
## $ gram : int 240 225 245 260 255 260 275 245 410 405 ...
summary(rats)
## ID Group time gram
## 1 : 11 1:88 Min. : 1.00 Min. :225.0
## 2 : 11 2:44 1st Qu.:15.00 1st Qu.:267.0
## 3 : 11 3:44 Median :36.00 Median :344.5
## 4 : 11 Mean :33.55 Mean :384.5
## 5 : 11 3rd Qu.:50.00 3rd Qu.:511.2
## 6 : 11 Max. :64.00 Max. :628.0
## (Other):110
#Plot the weight development of the individual rats
library(ggplot2)
ggplot(rats, aes(x = time, y = gram, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(rats$gram), max(rats$gram)))
#Standardize the weights in order to even clearer see the weight development
rats$gram_sd <- scale(rats$gram)
summary(rats)
## ID Group time gram gram_sd.V1
## 1 : 11 1:88 Min. : 1.00 Min. :225.0 Min. :-1.2541867
## 2 : 11 2:44 1st Qu.:15.00 1st Qu.:267.0 1st Qu.:-0.9238953
## 3 : 11 3:44 Median :36.00 Median :344.5 Median :-0.3144291
## 4 : 11 Mean :33.55 Mean :384.5 Mean : 0.0000000
## 5 : 11 3rd Qu.:50.00 3rd Qu.:511.2 3rd Qu.: 0.9969062
## 6 : 11 Max. :64.00 Max. :628.0 Max. : 1.9150375
## (Other):110
#Plot the standardized curves
ggplot(rats, aes(x = time, y = gram_sd, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(rats$gram_sd), max(rats$gram_sd)))
#Look into summative varaible mean for the different groups
#Create new tabel with mean weight per individ
rats_summative<- rats %>%
filter(time > 0) %>%
group_by(Group, ID) %>%
summarise( mean_gram=mean(gram) ) %>%
ungroup()
str(rats_summative)
## Classes 'tbl_df', 'tbl' and 'data.frame': 16 obs. of 3 variables:
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ mean_gram: num 261 238 260 267 269 ...
#Visualize the mean weights per group
ggplot(rats_summative, aes(x = Group, y = mean_gram)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=2, fill = "black") +
scale_y_continuous(name = "mean(gram)")
#The outlier we earlier saw in group 2 is very visible in this boxplot graph. It could be vise to remove it
#We filter it out and redraw the plot
rats_summative_no_outlier <- rats_summative %>% filter(mean_gram <570)
ggplot(rats_summative_no_outlier, aes(x = Group, y = mean_gram)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=2, fill = "black") +
scale_y_continuous(name = "mean(gram)")
#Check that means really are different for the groups with anova test -> they are clearly!
oneway.test(mean_gram~Group,data=rats_summative_no_outlier,var.equal=TRUE)
##
## One-way analysis of means
##
## data: mean_gram and Group
## F = 483.6, num df = 2, denom df = 12, p-value = 3.387e-12
When we plot the weight changes of the rats on an individual level, we see that the tendency is that the weight increases. Group 1 differs from group 2 and 3 as they individuals that belong to that group are much lighter than the two others, both in the start and in the end of the time period. Group 2 also has one individ whos weight, through all the weeks, is quite much higher than the rest of the rats.
If we standardize the weights and do the plot again, we see the below pictured information even clearer.
In order to further familarize ourselves with the data, we can look at it on a summative level. In the boxplot graph, we look at the means - and general distributions - of the different groups. Are they different? The boxplot clearly indicates so and also that the variation within groups is quite small but we can control it with a anova test. The p value of the test clearly shows ut the the means of the 3 different groups are highly unlikely to be same.
#Importing data and inspecting it
#bprs
bprs <- read.csv("data/bprs.csv")
bprs$treatment <- as.factor(bprs$treatment)
bprs$subject <- as.factor(bprs$subject)
str(bprs)
## 'data.frame': 360 obs. of 4 variables:
## $ treatment : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs_score: int 42 58 54 55 72 48 71 30 41 57 ...
#Extract week number from week strings
bprs$weeks <- as.character(bprs$weeks)
str(bprs)
## 'data.frame': 360 obs. of 4 variables:
## $ treatment : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs_score: int 42 58 54 55 72 48 71 30 41 57 ...
bprs$week_int <- as.numeric(gsub("week","",bprs$weeks))
summary(bprs) #week from 0 (baseline) to 8
## treatment subject weeks bprs_score week_int
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
str(bprs)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs_score: int 42 58 54 55 72 48 71 30 41 57 ...
## $ week_int : num 0 0 0 0 0 0 0 0 0 0 ...
head(bprs,3)
## treatment subject weeks bprs_score week_int
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
#We start our analysis with pretending the observations are independent (ie, we ignore the fact that they are done on same objects during different time periods) and use a normal linear regression
dim(bprs)#360 observations on 5 variables
## [1] 360 5
bprs_reg <- lm(bprs_score ~ week_int + treatment, data = bprs)
summary(bprs_reg)
##
## Call:
## lm(formula = bprs_score ~ week_int + treatment, data = bprs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week_int -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
#Plot to see signs that bprs score is dependent on treatment -> could be but difficult to say
library(ggplot2)
library(GGally)
ggpairs(bprs[,c(1,4:5)])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Random intercept model
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
bprs_rim <- lmer(bprs_score ~ week_int + treatment + (1 | subject), data = bprs, REML = FALSE)
summary(bprs_rim)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs_score ~ week_int + treatment + (1 | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week_int -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) wek_nt
## week_int -0.437
## treatment2 -0.282 0.000
#Random Intercept and Random Slope Model
bprs_rirsm <-lmer(bprs_score ~ week_int + treatment + (week_int | subject), data = bprs, REML = FALSE)
summary(bprs_rirsm)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs_score ~ week_int + treatment + (week_int | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week_int 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week_int -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) wek_nt
## week_int -0.582
## treatment2 -0.247 0.000
#Anova analysis to see defference between the two Random intercept models
anova(bprs_rirsm, bprs_rim )
## Data: bprs
## Models:
## bprs_rim: bprs_score ~ week_int + treatment + (1 | subject)
## bprs_rirsm: bprs_score ~ week_int + treatment + (week_int | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## bprs_rim 5 2748.7 2768.1 -1369.4 2738.7
## bprs_rirsm 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Random Intercept and Random Slope Model with interaction
bprs_rirsmwi <- lmer(bprs_score ~ week_int * treatment + (week_int | subject), data = bprs, REML = FALSE)
summary(bprs_rirsmwi)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs_score ~ week_int * treatment + (week_int | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week_int 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week_int -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week_int:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) wek_nt trtmn2
## week_int -0.650
## treatment2 -0.424 0.469
## wk_nt:trtm2 0.356 -0.559 -0.840
#Anova analysis to see defference between the last two Random intercept models
anova(bprs_rirsmwi,bprs_rirsm)
## Data: bprs
## Models:
## bprs_rirsm: bprs_score ~ week_int + treatment + (week_int | subject)
## bprs_rirsmwi: bprs_score ~ week_int * treatment + (week_int | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## bprs_rirsm 7 2745.4 2772.6 -1365.7 2731.4
## bprs_rirsmwi 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we start analysing our data and first take a normal linear regression model (see comments imbedded in code above), we an see that the time factor is significant with an average decrease in points with 2.3 per week increase. The treatment factor is not significant in this model.
Next, we will use the Random Intercept Model. It introduces a constant random “error term” connected to each subject. This may arise due to, fore example, a subject constantly exeggregating the effects measured. Most commoly reporting higher/lower values on a questionary than another subject with same “symptohms” would have done. In the summary of the model, we see that the random effects of the intercept from the subject has a standard deviation of 6.9, which can be seen to be quite high. If we look at the fixed effects, we see that “week” is still significant with an average decrease in points with 2.3 per week. The treatment term is still insignificant.
The Random Intercept and Random Slope Model that is tested next allows for each observations slope - and not only intercept - to differ. This means that not only is a random intercept added or substracted to each individual in order to estimate a point score but how much a change in time affects it. When we look at the summary of this model, we see that the sd of the random effects of subject is even higher than in the last model 8.1 and that the sd of week is only 1.0. Looking at the fixed effects, an increase on week still decreaseses point score with on average 2.3. Significance for week and treatment (=yes respectively no) is same as in the 2 earlier models.
An anova analysis on the two Random intercept outputs show us that the second model is to prefere before the first.
Our final model is the Random Intercept and Random Slope Model with interaction. It allows for an interaction between treatment and time and not only between individual and these. In the output from this model, we see similar values in the random effects as in the last model, sd for random effects of subject is still 8.1 and week 1.0. In the fixed effects, “week” is still significant with an average decrease in points with 2.7 per week. The treatment term is still insignificant.
The anova analysis of the two last models show that the third model is a little bit more precise than the second.